In [1]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
import numpy as np
import pandas as pd
import networkx as nx
from tethne.readers import wos
import igraph
import nltk
from collections import Counter
from tethne import Corpus
from helpers import extract_keywords, filter_token, normalize_token

2.1. Co-citation Analysis

In this workbook we will conduct a co-citation analysis using the approach outlined in Chen (2009). If you have used the Java-based desktop application CiteSpace II, this should be familiar: this is the same methodology that is implemented in that application.

Co-citation analysis gained popularity in the 1970s as a technique for “mapping” scientific literatures, and for finding latent semantic relationships among technical publications.

Two papers are co-cited if they are both cited by the same, third, paper. The standard approach to co-citation analysis is to generate a sample of bibliographic records from a particular field by using certain keywords or journal names, and then build a co-citation graph describing relationships among their cited references. Thus the majority of papers that are represented as nodes in the co-citation graph are not papers that responded to the selection criteria used to build the dataset.

Our objective in this tutorial is to identify papers that bridge the gap between otherwise disparate areas of knowledge in the scientific literature. In this tutorial, we rely on the theoretical framework described in Chen (2006) and Chen et al. (2009).

According to Chen, we can detect potentially transformative changes in scientific knowledge by looking for cited references that both (a) rapidly accrue citations, and (b) have high betweenness-centrality in a co-citation network. It helps if we think of each scientific paper as representing a “concept” (its core knowledge claim, perhaps), and a co-citation event as representing a proposition connecting two concepts in the knowledge-base of a scientific field. If a new paper emerges that is highly co-cited with two otherwise-distinct clusters of concepts, then that might mean that the field is adopting new concepts and propositions in a way that is structurally radical for their conceptual framework.

Chen (2009) introduces sigma ($\Sigma$) as a metric for potentially transformative cited references:

$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$

...where the betweenness centrality of each node v is:

$$ g(v) = \sum\limits_{i\neq j\neq v} \frac{\sigma_{ij} (v)}{\sigma_{ij}} $$

...where $\sigma_{ij}$ is the number of shortest paths from node i to node j and $\sigma_{ij}(v)$ is the number of those paths that pass through v. Burstness (0.-1. normalized) is estimated using Kleingberg’s (2002) automaton model, and is designed to detect rate-spikes around features in a stream of documents.


Note: In this notebook we will not use burstness, but rather the relative increase/decrease in citations from one year to the next. Burstness is helpful when we are dealing with higher-resolution time-frames, and/or we want to monitor a long stream of citation data. Since we will smooth our data with a multi-year time-window, burstness becomes a bit less informative, and the year-over-year change in citations (we'll call this Delta $\Delta$) is an intuitive alternative.


Load data

Here we have some field-tagged data from the Web of Science. We set streaming=True so that we don't load everything into memory all at once.


Note: When we stream the corpus, it is important to set index_fields and index_features ahead of time, so that we don't have to iterate over the whole corpus later on.



In [43]:
metadata = wos.read('../data/Baldwin/PlantPhysiology', 
                    streaming=True, index_fields=['date', 'abstract'], index_features=['citations'])

In [44]:
len(metadata)


Out[44]:
7849

Co-citation graph

Tethne provides a function called cocitation that creates co-citation graphs.


In [5]:
from tethne import cocitation

Co-citation graphs can get enormous quickly, and so it is important to set a threshold number of times that a paper must be cited to be included in the graph (min_weight). It's better to start high, and bring the threshold down as needed.

Note that edge_attrs should be set to whatever was the value of index_fields when we used read() (above).


In [6]:
graph = cocitation(metadata, min_weight=6., edge_attrs=['date'])

In [7]:
graph.order(), graph.size(), nx.number_connected_components(graph)


Out[7]:
(6537, 23121, 206)

Serialize

We can visualize our graph in Cytoscape to get a sense of its structure.


In [8]:
nx.write_graphml(graph, 'cocitation.graphml')

Sigma, $\Sigma$

Chen (2009) proposed sigma ($\Sigma$) as a metric for potentially transformations in a scientific literature.

$$ \Sigma(v) = (g(v) + 1)^{burstness(v)} $$

Note: In this notebook we will not use burstness, but rather the relative increase/decrease in citations from one year to the next. Burstness is helpful when we are dealing with higher-resolution time-frames, and/or we want to monitor a long stream of citation data. Since we will smooth our data with a multi-year time-window, burstness becomes a bit less informative, and the year-over-year change in citations (we'll call this Delta $\Delta(v)$) is an intuitive alternative. So:

$$ \Sigma(v) = (g(v) + 1)^{\Delta(v)} $$$$ \Delta(v) = \frac{N_t(v) - N_{t-1} }{max(1, N_{t-1})} $$

GraphCollection

Since we are interested in the evolution of the co-citation graph over time, we need to create a series of sequential graphs. Tethne provides a class called GraphCollection that will do this for us.

We pass metadata (or Corpus object), the cotation function, and then some configuration information:

  • slice_kwargs controls how the sequential time-slices are generated. The default is to use 1-year slices, and advance 1 year per slice. We are stating here that we want to extract only the citations feature from each slice (for performance).
  • method_kwargs controls the graph-building function. Here we pass our min_weight, and we also say that we don't want any attributes on the edges (for performance).

In [9]:
from tethne import GraphCollection
G = GraphCollection(metadata, cocitation,
                    slice_kwargs={'feature_name': 'citations'}, 
                    method_kwargs={'min_weight': 3, 'edge_attrs': []})

In [12]:
for year, graph in G.iteritems():
    print graph.order(), graph.size(), nx.number_connected_components(graph)
    nx.write_graphml(graph, 'cocitation_%i.graphml' % year)


406 1019 57
516 1257 60
682 1536 77
626 1336 67
917 2144 82
834 2241 91
627 1154 97
609 1657 65
660 1515 79
868 2215 68
792 1818 86
774 1657 83
839 2227 81
845 2549 71
391 638 67

Betweenness centrality

Recall that the betweenness centrality of each node v is:

$$ g(v) = \sum\limits_{i\neq j\neq v} \frac{\sigma_{ij} (v)}{\sigma_{ij}} $$

We can analyze all of the nodes in all of the graphs in our GraphCollection using its analyze() method.


In [13]:
# 'betweenness_centrality' is the name of the algorithm
#  in NetworkX that we want to use. ``invert=True`` means
#  that we want to organize the g(v) values by node, rather
#  than by time-period (the default).
g_v = G.analyze('betweenness_centrality', invert=True)

In [26]:
g_v.items()[89]


Out[26]:
(89,
 {1999: 0.012212559880326648,
  2000: 0.1451754690517866,
  2001: 0.11777211863697992,
  2002: 0.024040269161038387,
  2003: 0.012573209158147235,
  2004: 0.05753114988769691,
  2005: 0.002120708428951243,
  2006: 0.026932066769497706,
  2007: 0.025856622517020283,
  2008: 0.03452387454835741,
  2009: 0.011951828484182447,
  2010: 0.00529130551685833,
  2011: 0.008302012907130655,
  2012: 0.00036514767024372035,
  2013: 4.394348867356579e-06})

Organize our data

In order to calculate $\Sigma$ more efficiently, we'll organize our data about the graph in a DataFrame. Our DataFrame will have the following columns:

  • ID - The GraphCollection gives each node an integer ID. We'll need this to identify nodes in Cytoscape, later.
  • Node - This is the Author Year Journal label from the WoS data for a cited reference.
  • Citations - The number of citations.
  • Centrality - Betweenness centrality.
  • Delta - The year-over-year increase/decrease in citations.

This may take a bit, depending on the size of the graphs.


In [23]:
node_data = pd.DataFrame(columns=['ID', 'Node', 'Year', 'Citations', 'Centrality', 'Delta'])

i = 0
for n in G.node_index.keys():
    if n < 0:
        continue
        
    # node_history() gets the values of a node attribute over
    #  all of te graphs in the GraphCollection.
    g_n = G.node_history(n, 'betweenness_centrality')
    N_n = G.node_history(n, 'count')
    
    # Skip nodes whose g(v) never gets above 0.
    if max(g_n.values()) == 0:
        continue
    
    years = sorted(G.keys())   # Graphs are keyed by year.
    for year in years:    
        g_nt = g_n.get(year, 0.0)    # Centrality for this year.
        N_nt = float(N_n.get(year, 0.0))    # Citations for this year.
        # For the second year and beyond, calculate Delta.
        if year > years[0]:    
            N_nlast = N_n.get(year-1, 0.0)
            delta = (N_nt - N_nlast)/max(N_nlast, 1.)
        else:    
            delta = 0.0
            
        # We will add one row per node per year.
        node_data.loc[i] = [n, G.node_index[n], year, N_nt, g_nt, delta]
        i += 1

That was fairly computationally expensive. We should save the results so that we don't have to do that again.


In [29]:
node_data.to_csv('node_data.csv', encoding='utf-8')

Before calculating $\Sigma$ we will select a subset of the rows in our dataframe, to reduce the computational burden. Here we create a smaller DataFrame with only those rows in which both Centrality and Delta are not zero -- it should be obvious that these will have negligible $\Sigma$.


In [30]:
# Note the ``.copy()`` at the end -- this means that the new DataFrame will be a
#  stand-alone copy, and not just a "view" of the existing ``node_data`` DataFrame.
#  The practical effect is that we can add new data to the new ``candidates``
#  DataFrame without creating problems in the larger ``node_data`` DataFrame.
candidates = node_data[node_data.Centrality*node_data.Delta > 0.].copy()

Now we calculate $\Sigma$. Vector math is great!


In [31]:
candidates['Sigma'] = (1.+candidates.Centrality)**candidates.Delta

The nodes (in a given year) with the highest $\Sigma$ are our candidates for potential "transformations". Note that in Chen's model, the cited reference and its co-cited references are an emission of the "knowledge" of the field. In other words, the nodes and edges in our graph primarily say something about the records that are doing the citing, rather than the records that are cited. I.e. a node with $\Sigma$ is indicative of a transformation, but that is a description of the papers that cite it and not necessarily the paper that the node represents.


In [32]:
candidates.sort('Sigma', ascending=False)


Out[32]:
ID Node Year Citations Centrality Delta Sigma
29141 5060 BROWN_DM_2005_PLANT_CELL 2010 18 1.648243e-01 18.000000 15.584316
10849 1768 EISEN_MB_1998_P_NATL_ACAD_SCI_USA 2003 12 1.404663e-01 12.000000 4.841608
13774 2284 ULMASOV_T_1997_PLANT_CELL 2003 11 1.216092e-01 11.000000 3.533924
5792 895 KAUL_S_2000_NATURE 2001 32 3.894427e-02 32.000000 3.395878
33433 6039 PIETERSE_CMJ_2009_NAT_CHEM_BIOL 2012 15 4.728827e-02 15.000000 1.999832
2761 411 BUSH_DS_1995_ANNU_REV_PLANT_PHYS 2000 10 6.533943e-02 10.000000 1.883129
25764 4413 RUSSINOVA_E_2004_PLANT_CELL 2008 9 7.181309e-02 9.000000 1.866687
20094 3441 BENJAMINI_Y_1995_J_ROY_STAT_SOC_B_MET 2008 19 3.334857e-02 19.000000 1.865048
14855 2476 SHI_HZ_2002_PLANT_CELL 2004 9 6.633559e-02 9.000000 1.782564
16132 2728 KWAK_JM_2003_EMBO_J 2006 11 5.320022e-02 11.000000 1.768562
25254 4327 DETTMER_J_2006_PLANT_CELL 2008 10 5.690549e-02 10.000000 1.739248
19827 3398 VANCE_CP_2003_NEW_PHYTOL 2011 13 4.181121e-02 13.000000 1.703167
26785 4635 RAMAKERS_C_2003_NEUROSCI_LETT 2009 10 5.368633e-02 10.000000 1.686994
9303 1526 MALECK_K_2000_NAT_GENET 2002 10 5.168910e-02 10.000000 1.655289
33598 6071 MBENGUE_M_2010_PLANT_CELL 2012 7 7.077737e-02 7.000000 1.613966
10138 1654 XIE_DX_1998_SCIENCE 2012 10 4.609769e-02 10.000000 1.569359
27445 4755 OLDROYD_GED_2008_ANNU_REV_PLANT_BIOL 2009 9 5.030106e-02 9.000000 1.555336
15710 2646 OLSZEWSKI_N_2002_PLANT_CELL 2004 9 4.971979e-02 9.000000 1.547606
34439 6313 MORTAZAVI_A_2008_NAT_METHODS 2013 9 4.847406e-02 9.000000 1.531155
11464 1883 FOLTA_KM_2001_PLANT_J 2003 16 2.667175e-02 16.000000 1.523722
493 69 GUZMAN_P_1990_PLANT_CELL 2012 10 4.254831e-02 10.000000 1.516917
15080 2516 SANDERS_D_2002_PLANT_CELL 2004 10 4.206943e-02 10.000000 1.509964
19389 3328 LIVAK_KJ_2001_METHODS 2008 25 1.580790e-02 25.000000 1.480095
19107 3287 MISSON_J_2005_P_NATL_ACAD_SCI_USA 2011 11 3.591751e-02 11.000000 1.474270
12859 2135 RASHOTTE_AM_2001_PLANT_CELL 2003 8 4.871931e-02 8.000000 1.463100
29471 5135 BOERJAN_W_2003_ANNU_REV_PLANT_BIOL 2010 10 3.749098e-02 10.000000 1.444918
9920 1622 BECHTOLD_N_1998_METH_MOL_B 2004 9 4.083697e-02 9.000000 1.433654
29411 5128 COSGROVE_DJ_2005_NAT_REV_MOL_CELL_BIO 2010 16 2.242590e-02 16.000000 1.425967
26052 4464 BAENA-GONZALEZ_E_2007_NATURE 2011 9 3.924726e-02 9.000000 1.414067
16730 2821 WILLIAMS_DC_1998_BIOCHEMISTRY 2004 7 5.045831e-02 7.000000 1.411405
... ... ... ... ... ... ... ...
6576 1036 KASUGA_M_1999_NAT_BIOTECHNOL 2005 7 8.093717e-06 0.166667 1.000001
11271 1855 QUACKENBUSH_J_2001_NUCLEIC_ACIDS_RES 2005 6 2.555911e-06 0.500000 1.000001
5511 848 MIRONOV_V_1999_PLANT_CELL 2005 4 3.669557e-06 0.333333 1.000001
12740 2100 MUSSIG_C_2002_PLANT_PHYSIOL 2004 4 3.634696e-06 0.333333 1.000001
22167 3789 SANCHEZ-CALDERON_L_2005_PLANT_CELL_PHYSIOL 2011 5 2.193397e-07 5.000000 1.000001
23128 3980 PERAGINE_A_2004_GENE_DEV 2012 4 3.279493e-06 0.333333 1.000001
2002 289 GAFFNEY_T_1993_SCIENCE 2006 6 5.419232e-06 0.200000 1.000001
8435 1369 CLARK_KL_1998_P_NATL_ACAD_SCI_USA 2004 4 2.404808e-07 4.000000 1.000001
23563 4041 BAUD_S_2007_PLANT_J 2012 6 4.216492e-06 0.200000 1.000001
5944 919 THOMMA_BPHJ_1998_P_NATL_ACAD_SCI_USA 2003 6 3.877634e-06 0.200000 1.000001
7763 1245 MURPHY_A_2000_PLANTA 2007 4 2.306156e-06 0.333333 1.000001
32247 5709 DREW_MC_1975_NEW_PHYTOL 2011 4 1.900944e-07 4.000000 1.000001
16811 2837 GUO_FQ_2003_SCIENCE 2010 5 1.117151e-06 0.666667 1.000001
17602 3019 SMITH_SM_2004_PLANT_PHYSIOL 2006 4 2.167693e-06 0.333333 1.000001
4172 626 CLEMENS_S_1999_EMBO_J 2001 4 2.159454e-06 0.333333 1.000001
4472 669 VATAMANIUK_OK_1999_P_NATL_ACAD_SCI_USA 2001 4 2.159454e-06 0.333333 1.000001
22977 3939 MA_Z_2003_PLANT_PHYSIOL 2011 4 1.782135e-07 4.000000 1.000001
27463 4757 FOO_E_2007_PLANT_PHYSIOL 2012 4 1.653526e-07 4.000000 1.000001
9409 1543 VISION_TJ_2000_SCIENCE 2003 7 2.982795e-06 0.166667 1.000000
15493 2588 RUUSKA_SA_2002_PLANT_CELL 2012 5 1.873996e-06 0.250000 1.000000
14353 2386 FOCKS_N_1998_PLANT_PHYSIOL 2012 5 1.873996e-06 0.250000 1.000000
31393 5521 JAKOBY_M_2002_TRENDS_PLANT_SCI 2012 5 1.873996e-06 0.250000 1.000000
15748 2648 HUANG_YF_2003_PLANT_J 2012 4 1.124398e-06 0.333333 1.000000
12221 2028 JOHANSON_U_2000_SCIENCE 2010 5 4.787791e-07 0.666667 1.000000
27431 4754 UMEHARA_M_2008_NATURE 2010 7 1.675727e-06 0.166667 1.000000
27596 4780 GOMEZ-ROLDAN_V_2008_NATURE 2010 7 1.675727e-06 0.166667 1.000000
2544 375 KONCZ_C_1986_MOL_GEN_GENET 2008 9 9.132848e-07 0.285714 1.000000
2161 313 FANKHAUSER_C_1997_ANNU_REV_CELL_DEV_BI 2000 7 1.388867e-06 0.166667 1.000000
32428 5760 STIRNBERG_P_2007_PLANT_J 2012 5 1.756871e-07 0.666667 1.000000
8869 1447 YU_J_2002_SCIENCE 2003 10 7.385969e-07 0.111111 1.000000

2934 rows × 7 columns

Cluster labeling


In [24]:
clusters = pd.read_csv('clusters.dat', sep='\t', skiprows=9)

In [25]:
clusters


Out[25]:
Cluster Score (Density*#Nodes) Nodes Edges Node IDs
0 1 22.083 25 265 FUJIOKA_S_1997_NAT_PROD_REP, NOMURA_T_1997_PLA...
1 2 13.333 16 100 AUKERMAN_MJ_1997_PLANT_CELL, QUAIL_PH_1995_SCI...
2 3 12.889 19 116 BASU_U_1994_J_PLANT_PHYSIOL, SIVAGURU_M_1998_P...
3 4 12.615 14 82 KIEBER_JJ_1993_CELL, WILKINSON_JQ_1997_NAT_BIO...
4 5 10.400 11 52 MCQUEENMASON_S_1994_P_NATL_ACAD_SCI_USA, BRUMM...
5 6 10.000 10 45 GRIFFITH_M_1992_PLANT_PHYSIOL, KROL_M_1984_CAN...
6 7 9.800 11 49 SENTENAC_H_1992_SCIENCE, SCHACHTMAN_DP_1994_NA...
7 8 9.400 11 47 NI_BR_1993_PLANT_PHYSIOL, GROOT_SPC_1992_PLANT...
8 9 9.111 10 41 PERRY_SE_1994_PLANT_CELL, HIRSCH_S_1994_SCIENC...
9 10 9.000 9 36 POTIKHA_TS_1999_PLANT_PHYSIOL, WINGE_P_1997_PL...
10 11 8.750 9 35 MOUILLE_G_1996_PLANT_CELL, HARRIS_EH_1989_CHLA...
11 12 8.545 12 47 MAUREL_C_1993_EMBO_J, WEIG_A_1997_PLANT_PHYSIO...
12 13 8.000 8 28 CHIANG_HH_1995_PLANT_CELL, PHILLIPS_AL_1995_PL...
13 14 7.556 10 34 LAMB_C_1997_ANNU_REV_PLANT_PHYS, KELLER_T_1998...
14 15 7.143 8 25 FELLE_HH_1996_PLANT_J, SCHULTZE_M_1992_P_NATL_...
15 16 7.143 8 25 VATAMANIUK_OK_1999_P_NATL_ACAD_SCI_USA, ZHU_YL...
16 17 6.857 8 24 PITTS_RJ_1998_PLANT_J, WADA_T_1997_SCIENCE, GA...
17 18 6.444 10 29 BLUME_B_1997_PLANT_J, KENDE_H_1993_ANNU_REV_PL...
18 19 6.333 7 19 CAMPBELL_MM_1996_PLANT_PHYSIOL, OSAKABE_K_1999...
19 20 6.000 6 15 ANGENENT_GC_1994_PLANT_J, PNUELI_L_1994_PLANT_...
20 21 6.000 6 15 SIEDOW_JN_1995_PLANT_CELL, VANLERBERGHE_GC_199...
21 22 6.000 7 18 SAXENA_IM_1995_J_BACTERIOL, TURNER_SR_1997_PLA...
22 23 5.833 13 35 JANG_JC_1994_PLANT_CELL, SMEEKENS_S_1997_PLANT...
23 24 5.600 6 14 CHANEY_RL_1972_PLANT_PHYSIOL, LANDSBERG_EC_198...
24 25 5.600 6 14 UTSUNO_K_1998_PLANT_CELL_PHYSIOL, MULLER_A_199...
25 26 5.273 12 29 JIANG_C_1996_PLANT_MOL_BIOL, BOHNERT_HJ_1995_P...
26 27 5.000 5 10 STONE_JM_1994_SCIENCE, WILLIAMS_RW_1997_P_NATL...
27 28 5.000 5 10 SAKAKIBARA_H_1991_J_BIOL_CHEM, GREGERSON_RG_19...
28 29 5.000 5 10 GODDIJN_OJM_1997_PLANT_PHYSIOL, VOGEL_G_1998_P...
29 30 4.700 21 47 NOCTOR_G_1998_ANNU_REV_PLANT_PHYS, ALLAN_AC_19...
... ... ... ... ... ...
51 52 3.333 4 5 ANDERBERG_RJ_1992_P_NATL_ACAD_SCI_USA, LEUNG_J...
52 53 3.333 4 5 PFANNSCHMIDT_T_1999_NATURE, DANON_A_1994_SCIEN...
53 54 3.333 4 5 WU_K_1997_PLANT_PHYSIOL, FERL_RJ_1996_ANNU_REV...
54 55 3.200 6 8 IYER_S_1998_PLANT_PHYSIOL, SMIRNOFF_N_1989_PHY...
55 56 3.200 6 8 CRAWFORD_NM_1998_TRENDS_PLANT_SCI, RAWAT_SR_19...
56 57 3.000 3 3 HUIJSER_P_1992_EMBO_J, THEISSEN_G_1996_J_MOL_E...
57 58 3.000 3 3 TAKEDA_Y_1993_CARBOHYD_RES, BHATTACHARYYA_MK_1...
58 59 3.000 3 3 TAKABAYASHI_J_1996_TRENDS_PLANT_SCI, ALBORN_HT...
59 60 3.000 3 3 SACK_FD_1991_INT_REV_CYTOL, LEGUE_V_1997_PLANT...
60 61 3.000 3 3 PRASAD_TK_1994_PLANT_CELL, MORITA_S_1999_PLANT...
61 62 3.000 3 3 JEFFERSON_R_1987_PLANT_MOL_BIOL_REP, SAMBROOK_...
62 63 3.000 3 3 EDELMAN_J_1968_NEW_PHYTOL, VANDENENDE_W_1996_P...
63 64 3.000 3 3 FRANKLINTONG_VE_1996_PLANT_CELL, ALEXANDRE_J_1...
64 65 3.000 3 3 NORMANLY_J_1997_PHYSIOL_PLANTARUM, BARTEL_B_19...
65 66 3.000 3 3 SUBBAIAH_CC_1994_PLANT_PHYSIOL, DOLFERUS_R_199...
66 67 3.000 3 3 WELLER_JL_1994_PLANTA, LOPEZJUEZ_E_1995_PLANT_...
67 68 3.000 3 3 LI_JY_1993_PLANT_CELL, LANDRY_LG_1995_PLANT_PH...
68 69 3.000 3 3 HARMS_K_1995_PLANT_CELL, LAUDERT_D_1998_PLANT_...
69 70 3.000 3 3 CHEN_ZY_1997_PLANT_PHYSIOL, ERIKSSON_M_1996_P_...
70 71 3.000 3 3 RENAUDIN_JP_1996_PLANT_MOL_BIOL, HUNTLEY_R_199...
71 72 3.000 3 3 HALFTER_U_2000_P_NATL_ACAD_SCI_USA, SHI_HZ_200...
72 73 3.000 3 3 BROCHETTOBRAGA_MR_1992_PLANT_PHYSIOL, GALILI_G...
73 74 3.000 3 3 MIZUTANI_M_1997_PLANT_PHYSIOL, BELLLELONG_DA_1...
74 75 3.000 3 3 UGGLA_C_1998_PLANT_PHYSIOL, TUOMINEN_H_1997_PL...
75 76 3.000 3 3 GEIGENBERGER_P_1997_PLANTA, TRETHEWEY_RN_1998_...
76 77 3.000 3 3 ELBOROUGH_KM_1996_BIOCHEM_J, CHOI_JK_1995_PLAN...
77 78 3.000 3 3 HASE_T_1991_PLANT_PHYSIOL, KIMATA_Y_1989_PLANT...
78 79 3.000 3 3 MARTINOIA_E_1993_NATURE, ALFENITO_MR_1998_PLAN...
79 80 3.000 3 3 KRONZUCKER_HJ_1995_PLANTA, INGEMARSSON_B_1987_...
80 81 3.000 3 3 KARDAILSKY_I_1999_SCIENCE, KOORNNEEF_M_1991_MO...

81 rows × 5 columns


In [33]:
clusters['Node IDs'][0].split(', ')


Out[33]:
['FUJIOKA_S_1997_NAT_PROD_REP',
 'NOMURA_T_1997_PLANT_PHYSIOL',
 'FUJIOKA_S_1997_PLANT_CELL',
 'CLOUSE_SD_1999_BRASSINOSTEROIDS',
 'SZEKERES_M_1996_CELL',
 'CHOI_YH_1997_PHYTOCHEMISTRY',
 'KAUSCHMANN_A_1996_PLANT_J',
 'AZPIROZ_R_1998_PLANT_CELL',
 'BISHOP_GJ_1996_PLANT_CELL',
 'LI_JM_1997_CELL',
 'NOGUCHI_T_1999_PLANT_PHYSIOL',
 'YOKOTA_T_1997_TRENDS_PLANT_SCI',
 'FUJIOKA_S_1997_PHYSIOL_PLANTARUM',
 'CHOE_SW_1999_PLANT_CELL',
 'CHOE_S_1999_PLANT_PHYSIOL',
 'FUJIOKA_S_1996_PLANT_CELL_PHYSIOL',
 'BISHOP_GJ_1999_P_NATL_ACAD_SCI_USA',
 'LI_JM_1996_SCIENCE',
 'KLAHRE_U_1998_PLANT_CELL',
 'CLOUSE_SD_1998_ANNU_REV_PLANT_PHYS',
 'CHOE_SW_1998_PLANT_CELL',
 'FUJIOKA_S_1995_BIOSCI_BIOTECH_BIOCH',
 'NOMURA_T_1999_PLANT_PHYSIOL',
 'TAKAHASHI_T_1995_GENE_DEV',
 'CLOUSE_SD_1996_PLANT_PHYSIOL']

In [34]:
citations = metadata.features['citations']

In [35]:
citing = Counter()
for reference in clusters['Node IDs'][0].split(', '):
    for idx in citations.papers_containing(reference):
        citing[idx] += 1.
chunk = [idx for idx, value in citing.items() if value > 2.]

This step can take a few minutes.


In [48]:
abstracts = {}
for abstract, wosid in metadata.indices['abstract'].iteritems():
    print '\r', wosid[0],
    abstracts[wosid[0]] = abstract


WOS:000185974800027

In [49]:
abstracts.items()[5]


Out[49]:
(u'WOS:000283710300030',
 u'Actinorhizal symbioses are mutualistic interactions between plants and the soil bacteria Frankia that lead to the formation of nitrogen-fixing root nodules. Little is known about the signaling mechanisms controlling the different steps of the establishment of the symbiosis. The plant hormone auxin has been suggested to play a role. Here we report that auxin accumulates within Frankia-infected cells in actinorhizal nodules of Casuarina glauca. Using a combination of computational modeling and experimental approaches, we establish that this localized auxin accumulation is driven by the cell-specific expression of auxin transporters and by Frankia auxin biosynthesis in planta. Our results indicate that the plant actively restricts auxin accumulation to Frankia-infected cells during the symbiotic interaction.')

In [85]:
document_token_counts = nltk.ConditionalFreqDist([
    (wosid, normalize_token(token))
    for wosid, abstract in abstracts.items()
    for token in nltk.word_tokenize(abstract)
    if filter_token(token)
])

In [91]:
extract_keywords(document_token_counts, lambda k: k in chunk)


Out[91]:
array([u'br', u'brassinosteroid', u'dwarf', u'brassinolide',
       u'brassinazole', u'dpy', u'ikb', u'campestanol', u'campesterol',
       u'tracheary', u'biosynthetic', u'characterization', u'five',
       u'deficient', u'ser', u'late', u'castasterone', u'ika', u'sterol',
       u'dwarfism'], 
      dtype='<U19')

In [93]:
cluster_keywords = {}
for i, row in clusters.iterrows():
    citing = Counter()
    for reference in row['Node IDs'].split(', '):
        for idx in citations.papers_containing(reference):
            citing[idx] += 1.
    chunk = [idx for idx, value in citing.items() if value > 2.]     
    
    cluster_keywords[row.Cluster] = extract_keywords(document_token_counts, lambda k: k in chunk)

In [94]:
cluster_keywords


Out[94]:
{1: array([u'br', u'brassinosteroid', u'dwarf', u'brassinolide',
        u'brassinazole', u'dpy', u'ikb', u'campestanol', u'campesterol',
        u'tracheary', u'biosynthetic', u'characterization', u'five',
        u'deficient', u'ser', u'late', u'castasterone', u'ika', u'sterol',
        u'dwarfism'], 
       dtype='<U19'),
 2: array([u'phya', u'phyb', u'phytochrome', u'light', u'irradiation',
        u'pulse', u'bvr', u'flowering', u'destruction', u'phytochromobilin',
        u'response', u'lfy', u'intermittent', u'shade', u'swelling',
        u'eafl', u'hir', u'red', u'camv', u'peeled'], 
       dtype='<U19'),
 3: array([u'dtz', u'citrate', u'efflux', u'malate', u'root', u'anion',
        u'aluminum', u'organic', u'callose', u'across', u'secretion',
        u'var', u'signalgrass', u'border', u'peroxidation', u'exudation',
        u'triticale', u'rootlet', u'solution', u'ruzigrass'], 
       dtype='<U19'),
 4: array([u'ethylene', u'receptor', u'ac', u'epi', u'ostmk', u'adh',
        u'insensitivity', u'pollination', u'virulent', u'avirulent',
        u'oskapp', u'cutting', u'pleiotropic', u'dominant', u'tomato',
        u'ripening', u'melo', u'mode', u'gravicurvature', u'following'], 
       dtype='<U19'),
 5: array([u'expansin', u'expansins', u'wall', u'strawberry', u'beta', u'pine',
        u'weakening', u'rheology', u'loblolly', u'fruit', u'extension',
        u'mrna', u'ripe', u'tomato', u'ripening', u'terminal', u'primary',
        u'detected', u'low', u'internode'], 
       dtype='<U19'),
 6: array([u'antifreeze', u'rye', u'winter', u'cold', u'clp', u'afp',
        u'chitinases', u'tlp', u'ice', u'individual', u'similar',
        u'exposed', u'protein', u'may', u'induced', u'two', u'apoplastic',
        u'polypeptide', u'complex', u'form'], 
       dtype='<U19'),
 7: array([u'cation', u'cng', u'camaldulensis', u'echkt', u'functional',
        u'uptake', u'ion', u'channel', u'transporter', u'permeability',
        u'component', u'starvation', u'contribution', u'cyclic', u'family',
        u'oocyte', u'potassium', u'member', u'current', u'wheat'], 
       dtype='<U19'),
 8: array([u'cap', u'radicle', u'emergence', u'endosperm', u'seed',
        u'weakening', u'chitinase', u'germination', u'mrna', u'tomato',
        u'expressed', u'tissue', u'beta', u'lateral', u'low', u'water',
        u'increased', u'embryo', u'wall', u'enzyme'], 
       dtype='<U19'),
 9: array([u'import', u'sstp', u'pfdiii', u'gtp', u'envelope', u'transit',
        u'dnak', u'affinity', u'precursor', u'hpl', u'possible',
        u'characteristic', u'ctp', u'imported', u'analog', u'stroma',
        u'guanosine', u'homologs', u'chloroplast', u'component'], 
       dtype='<U19'),
 10: array([u'rop', u'crib', u'gap', u'dynamic', u'gtpase', u'rep', u'network',
        u'pop', u'ropgaps', u'rho', u'transformed', u'rac', u'ro', u'burst',
        u'vacuole', u'stimulus', u'oxidative', u'soybean', u'tonoplast',
        u'family'], 
       dtype='<U19'),
 11: array([u'isoamylase', u'amylopectin', u'phytoglycogen', u'starch',
        u'pullulanase', u'polysaccharide', u'dextrinase', u'material',
        u'limit', u'granule', u'outer', u'oligosaccharide', u'mutation',
        u'dbe', u'germinated', u'among', u'transfer', u'enzyme', u'gbssi',
        u'cause'], 
       dtype='<U19'),
 12: array([u'intrinsic', u'mips', u'boron', u'distal', u'enters', u'based',
        u'nomenclature', u'aquaporins', u'highly', u'aqps', u'atmips',
        u'rewetting', u'subfamily', u'pip', u'soil', u'tonoplast',
        u'permeability', u'conductivity', u'membrane', u'conserved'], 
       dtype='<U19'),
 13: array([u'pumpkin', u'silique', u'sln', u'slender', u'lettuce', u'inactive',
        u'pistil', u'parthenocarpic', u'potato', u'night', u'etiolated',
        u'application', u'normal', u'transfer', u'mol', u'product', u'show',
        u'regulation', u'transgene', u'developing'], 
       dtype='<U19'),
 14: array([u'ro', u'rac', u'generation', u'superoxide', u'tmv', u'physiol',
        u'band', u'algr', u'aos', u'native', u'phox', u'incompatible',
        u'attacked', u'elicitins', u'scopoletin', u'effective', u'elicit',
        u'dpi', u'abolished', u'formazan'], 
       dtype='<U19'),
 15: array([u'nfs', u'glcnac', u'nod', u'factor', u'meliloti', u'oligomers',
        u'pge', u'leguminosarum', u'responded', u'substituents',
        u'microsome', u'end', u'thread', u'viciae', u'store',
        u'modification', u'chitin', u'cytosolic', u'proteolysis', u'free'], 
       dtype='<U19'),
 16: array([u'glutathione', u'pc', u'arsenate', u'azuki', u'target',
        u'tolerance', u'higher', u'antibody', u'bean', u'synthetase',
        u'level', u'study', u'gsh', u'plant', u'concentration', u'uptake',
        u'gene', u'accumulation', u'activity', u'root'], 
       dtype='<U19'),
 17: array([u'hair', u'xet', u'cutting', u'root', u'action', u'adventitious',
        u'localized', u'iron', u'tip', u'initiation', u'formation',
        u'transfer', u'deficiency', u'reductase', u'hormone', u'petunia',
        u'regulated', u'seedling', u'wall', u'increase'], 
       dtype='<U19'),
 18: array([u'ethylene', u'ac', u'acc', u'adh', u'carnation', u'pollination',
        u'banana', u'apple', u'ripening', u'cofactor', u'greatly', u'licl',
        u'distinct', u'following', u'pistil', u'flower', u'senescence',
        u'orchid', u'hypoxia', u'petal'], 
       dtype='<U19'),
 19: array([u'comt', u'ccoaomt', u'lignin', u'ccr', u'parent', u'populus',
        u'ferulate', u'ccoaomts', u'diploid', u'sinapine', u'recently',
        u'transformed', u'repression', u'brassica', u'antisense', u'poplar',
        u'napu', u'double', u'class', u'transformants'], 
       dtype='<U19'),
 20: array([u'later', u'tsam', u'floral', u'mads', u'node', u'partner',
        u'flower', u'residue', u'orchid', u'transition', u'interaction',
        u'box', u'subfamily', u'expressed', u'development', u'gene',
        u'process', u'meristem', u'stage', u'showed'], 
       dtype='<U19'),
 21: array([u'aox', u'alternative', u'respiration', u'ucp', u'mung', u'peaked',
        u'pyruvate', u'correlation', u'mango', u'main', u'regulatory',
        u'cm', u'ubiquinone', u'uncoupling', u'oxidation', u'gly', u'stage',
        u'mitochondrion', u'upon', u'respiratory'], 
       dtype='<U19'),
 22: array([u'cellulose', u'cga', u'cyanobacteria', u'collection', u'dcb',
        u'synthases', u'glucan', u'amplified', u'prediction', u'cesa',
        u'fragment', u'kor', u'single', u'hair', u'putative', u'mass',
        u'xyloglucan', u'fiber', u'also', u'delta'], 
       dtype='<U19'),
 23: array([u'glc', u'man', u'sugar', u'din', u'fru', u'hxk', u'whorl', u'tre',
        u'disaccharide', u'hexose', u'fructan', u'repress', u'moiety',
        u'hexokinase', u'stimulated', u'extent', u'sensing',
        u'coordinately', u'rsip', u'balance'], 
       dtype='<U19'),
 24: array([u'deficiency', u'iron', u'fluid', u'space', u'apoplastic', u'part',
        u'sap', u'several', u'transfer', u'protoplast', u'xylem', u'root',
        u'reductase', u'organic', u'large', u'hormone', u'mesophyll',
        u'hair', u'regulated', u'respiration'], 
       dtype='<U19'),
 25: array([u'csi', u'polar', u'iba', u'auxin', u'npa', u'vein', u'toward',
        u'gravity', u'iaa', u'inflorescence', u'blocked', u'basipetal',
        u'agravitropic', u'bending', u'polarity', u'transport',
        u'inhibitor', u'gravitropic', u'root', u'fiber'], 
       dtype='<U19'),
 26: array([u'antifreeze', u'cold', u'stz', u'cryoprotectin', u'freezing',
        u'cbf', u'overexpressing', u'samt', u'tolerance', u'rye',
        u'antiport', u'cryoprotective', u'chitinases', u'environmental',
        u'adult', u'dioxygenase', u'acclimate', u'cor', u'transcriptional',
        u'salt'], 
       dtype='<U19'),
 27: array([u'ostmk', u'oskapp', u'ser', u'named', u'five', u'kinase',
        u'residue', u'peptide', u'petunia', u'receptor', u'identified',
        u'rice', u'domain', u'cdna', u'protein', u'cell', u'plant', u'gene',
        u'mutant', u'activity'], 
       dtype='<U19'),
 28: array([u'oka', u'alfalfa', u'exodermis', u'infected', u'symbiotic',
        u'epidermis', u'parenchyma', u'nodule', u'zone', u'phosphatase',
        u'nitrogen', u'accumulation', u'mrna', u'plastid', u'induced',
        u'expression', u'rice', u'cell', u'protein', u'root'], 
       dtype='<U19'),
 29: array([u'trehalose', u'trehalase', u'lepidophylla', u'tre', u'fructan',
        u'agpase', u'lacking', u'stimulated', u'nkat', u'suc', u'starch',
        u'presence', u'effect', u'mutant', u'induction', u'high',
        u'biosynthesis', u'delta', u'induced', u'increased'], 
       dtype='<U19'),
 30: array([u'ro', u'asc', u'peroxidase', u'burst', u'isr', u'ig', u'aphid',
        u'cellulase', u'rld', u'puget', u'tmv', u'roi', u'lincoln',
        u'carrying', u'nae', u'pathogen', u'papilla', u'avirulent',
        u'virulent', u'psg'], 
       dtype='<U19'),
 31: array([u'cation', u'cng', u'channel', u'magnitude', u'functional',
        u'guard', u'permeability', u'mesophyll', u'oocyte', u'contribution',
        u'cyclic', u'family', u'opening', u'transporter', u'member',
        u'animal', u'nucleotide', u'current', u'aba', u'study'], 
       dtype='<U19'),
 32: array([u'distal', u'transference', u'rewetting', u'water', u'soil',
        u'tonoplast', u'conductivity', u'mips', u'chamber', u'region',
        u'root', u'inhibition', u'hydraulic', u'protoplast', u'fraction',
        u'decreased', u'seedling', u'reduced', u'condition', u'stomatal'], 
       dtype='<U19'),
 33: array([u'pump', u'crt', u'atpases', u'cadpr', u'release', u'terminus',
        u'altered', u'location', u'localization', u'member', u'metal',
        u'phosphorylation', u'domain', u'cam', u'yeast', u'fluorescence',
        u'peptide', u'heavy', u'medium', u'arabidopsis'], 
       dtype='<U19'),
 34: array([u'pme', u'cedar', u'yellow', u'megagametophyte', u'breakage',
        u'weakening', u'cap', u'seed', u'endosperm', u'radicle',
        u'emergence', u'germination', u'dormancy', u'mrna', u'lateral',
        u'isoforms', u'tissue', u'expressed', u'tomato', u'wall'], 
       dtype='<U19'),
 35: array([u'agrobacterium', u'prior', u'gynoecium', u'procambial', u'ovule',
        u'transformation', u'vascular', u'division', u'method', u'family',
        u'transformants', u'differentiation', u'gu', u'pollen', u'plastid',
        u'xylem', u'plant', u'level', u'transgenic', u'protein'], 
       dtype='<U19'),
 36: array([u'pro', u'glu', u'via', u'throughout', u'catabolism', u'feedback',
        u'berry', u'deposition', u'orn', u'rna', u'expressing', u'psi',
        u'low', u'labeling', u'tip', u'role', u'regulation', u'synthesis',
        u'stress', u'biosynthesis'], 
       dtype='<U19'),
 37: array([u'wheat', u'sbei', u'gbssii', u'sbeii', u'triticum', u'sbeic',
        u'ssiii', u'motif', u'repeat', u'endosperm', u'gbssi', u'sbeiia',
        u'granule', u'exon', u'starch', u'produced', u'cdna', u'encoding',
        u'sequence', u'region'], 
       dtype='<U19'),
 38: array([u'snare', u'dynamic', u'network', u'rop', u'carrot', u'vacuole',
        u'polypeptide', u'tonoplast', u'number', u'vesicle', u'signal',
        u'development', u'membrane', u'arabidopsis', u'gene', u'activity',
        u'root', u'leaf', u'expression', u'mutant'], 
       dtype='<U19'),
 39: array([u'adh', u'acclimation', u'extent', u'hypoxic', u'balance',
        u'oxygen', u'hypoxia', u'anoxia', u'survival', u'invertase',
        u'shoot', u'induction', u'addition', u'suc', u'ethylene', u'carbon',
        u'tip', u'sugar', u'signaling', u'protein'], 
       dtype='<U19'),
 40: array([u'along', u'alga', u'limiting', u'acetabulum', u'complementation',
        u'portion', u'middle', u'stalk', u'apical', u'rhizoid',
        u'acclimation', u'adult', u'reinhardtii', u'group', u'amino',
        u'signal', u'purified', u'carbon', u'low', u'change'], 
       dtype='<U19'),
 41: array([u'internode', u'dpy', u'brassinosteroid', u'elongating', u'xtr',
        u'brassinolide', u'xet', u'xyloglucan', u'hypocotyl', u'four',
        u'auxin', u'expression', u'mutant', u'reduced', u'rice', u'tomato',
        u'gene', u'elongation', u'expressed', u'mrna'], 
       dtype='<U19'),
 42: array([u'monoterpene', u'peppermint', u'gland', u'monoterpenes',
        u'peltate', u'ultrastructure', u'glandular', u'age', u'oil',
        u'stalk', u'secretory', u'biosynthesis', u'defense', u'enzyme',
        u'biosynthetic', u'initiation', u'process', u'leaf', u'rate',
        u'accumulation'], 
       dtype='<U19'),
 43: array([u'gbssii', u'wheat', u'ssiii', u'antiserum', u'zssi', u'motif',
        u'eliminated', u'soluble', u'repeat', u'endosperm', u'gbssi',
        u'granule', u'starch', u'exon', u'encoding', u'extract', u'region',
        u'tissue', u'domain', u'cdna'], 
       dtype='<U19'),
 44: array([u'asa', u'gldh', u'embryonic', u'ax', u'incorporation',
        u'ascorbate', u'pool', u'glc', u'mitochondrial', u'mitochondrion',
        u'electron', u'accumulation', u'membrane', u'activity', u'cell',
        u'plant', u'protein', u'gene', u'root', u'leaf'], 
       dtype='<U19'),
 45: array([u'bvr', u'phytochromobilin', u'pchlide', u'heme', u'tetrapyrrole',
        u'ala', u'line', u'family', u'phenotype', u'inhibition',
        u'synthesis', u'mutant', u'arabidopsis', u'cell', u'protein',
        u'activity', u'root', u'leaf', u'expression', u'plant'], 
       dtype='<U19'),
 46: array([u'pld', u'superoxide', u'oleate', u'alpha', u'gamma', u'effective',
        u'generation', u'subcellular', u'delta', u'molecular', u'fraction',
        u'production', u'organ', u'beta', u'binding', u'lipid', u'mutation',
        u'acid', u'leaf', u'activity'], 
       dtype='<U19'),
 47: array([u'apase', u'apa', u'phosphorus', u'phosphate', u'phosphodiesterase',
        u'phi', u'mustard', u'rils', u'common', u'indian', u'nmp',
        u'significant', u'white', u'starvation', u'secreted',
        u'phosphatase', u'extracellular', u'nucleotide', u'promoter',
        u'deficiency'], 
       dtype='<U19'),
 48: array([u'senescence', u'death', u'pcd', u'sark', u'peptidase', u'ecm',
        u'sag', u'te', u'clpp', u'mmps', u'postharvest', u'petal',
        u'rupture', u'floret', u'head', u'trap', u'programmed', u'animal',
        u'rnase', u'biochemical'], 
       dtype='<U19'),
 49: array([u'dbe', u'pullulanase', u'isoamylase', u'material', u'dbes',
        u'gbssi', u'amylopectin', u'polysaccharide', u'endosperm', u'null',
        u'mutation', u'kernel', u'presence', u'allele', u'starch',
        u'identified', u'result', u'developing', u'enzyme', u'effect'], 
       dtype='<U19'),
 50: array([u'cryoprotectin', u'cryoprotective', u'smhsps', u'cortical',
        u'parenchyma', u'freezing', u'enhanced', u'transfer', u'fraction',
        u'tolerance', u'acclimation', u'presence', u'winter', u'cold',
        u'low', u'change', u'lipid', u'temperature', u'protein',
        u'accumulation'], 
       dtype='<U19'),
 51: array([u'shi', u'ostmk', u'oskapp', u'lermax', u'gse', u'grd', u'class',
        u'erecta', u'dwarf', u'show', u'kinase', u'elongation', u'rice',
        u'allele', u'domain', u'effect', u'response', u'mutant', u'growth',
        u'cell'], 
       dtype='<U19'),
 52: array([u'hosak', u'calcineurin', u'sipk', u'nacl', u'kinase', u'involved',
        u'activation', u'stress', u'yeast', u'salt', u'identified',
        u'phenotype', u'protein', u'arabidopsis', u'gene', u'two',
        u'expression', u'seed', u'tobacco', u'mutant'], 
       dtype='<U19'),
 53: array([u'psba', u'redox', u'psae', u'state', u'crhr', u'dtt', u'orange',
        u'dcmu', u'electron', u'carrier', u'light', u'gu', u'photosystem',
        u'transcript', u'degradation', u'regulated', u'transcription',
        u'rna', u'accumulation', u'blue'], 
       dtype='<U19'),
 54: array([u'bound', u'gave', u'information', u'mass', u'data', u'protein',
        u'guard', u'germination', u'isoforms', u'flower', u'chloroplast',
        u'embryo', u'gene', u'barley', u'expressed', u'two', u'expression',
        u'response', u'arabidopsis', u'plant'], 
       dtype='<U19'),
 55: array([u'pro', u'glu', u'patatin', u'throughout', u'via', u'catabolism',
        u'regulated', u'feedback', u'berry', u'deposition', u'orn',
        u'carbohydrate', u'induction', u'rna', u'sugar', u'expressing',
        u'psi', u'low', u'promoter', u'labeling'], 
       dtype='<U19'),
 56: array([u'nitrite', u'hat', u'excreted', u'nitrate', u'gln', u'glu',
        u'transporter', u'influx', u'abundance', u'transcript', u'amino',
        u'mutant', u'system', u'uptake', u'decreased', u'nitrogen', u'high',
        u'effect', u'acid', u'gene'], 
       dtype='<U19'),
 57: array([u'floral', u'later', u'tsam', u'partner', u'flower', u'residue',
        u'orchid', u'meristem', u'transition', u'interaction', u'subfamily',
        u'temulentum', u'expressed', u'within', u'stage', u'development',
        u'showed', u'induction', u'gene', u'region'], 
       dtype='<U19'),
 58: array([u'sbei', u'sbeii', u'sbeic', u'triticum', u'beiib', u'sbeiia',
        u'wheat', u'produced', u'endosperm', u'sequence', u'transcription',
        u'chain', u'maize', u'mutation', u'cdna', u'gene', u'expression',
        u'plant', u'root', u'leaf'], 
       dtype='<U19'),
 59: array([u'sexta', u'facs', u'pda', u'coronatine', u'biological',
        u'volatile', u'fa', u'endogenous', u'active', u'type',
        u'biosynthesis', u'pattern', u'signal', u'increase', u'transcript',
        u'pathway', u'acid', u'response', u'activity', u'cell'], 
       dtype='<U19'),
 60: array([u'gravistimulation', u'columella', u'phc', u'pulvinus', u'nodal',
        u'insp', u'pulvini', u'tier', u'positional', u'central',
        u'gravitropic', u'half', u'oat', u'peripheral', u'change',
        u'amyloplasts', u'unit', u'form', u'domain', u'cell'], 
       dtype='<U19'),
 61: array([u'apx', u'photooxidative', u'phl', u'ndh', u'ndhf', u'isoenzyme',
        u'lignification', u'isoenzymes', u'ndhb', u'cat', u'stress',
        u'oxidative', u'transcript', u'polypeptide', u'peroxidase',
        u'system', u'induction', u'treatment', u'complex', u'level'], 
       dtype='<U19'),
 62: array([u'villin', u'atvln', u'atvlns', u'phaseolus', u'high', u'transgene',
        u'transgenic', u'expression', u'sequence', u'seed', u'protein',
        u'plant', u'level', u'gene', u'cell', u'activity', u'root', u'leaf',
        u'mutant', u'acid'], 
       dtype='<U19'),
 63: array([u'fructan', u'transferase', u'chicory', u'iia', u'preparation',
        u'corresponding', u'cdna', u'phloem', u'production',
        u'concentration', u'enzyme', u'acid', u'cell', u'plant', u'protein',
        u'gene', u'root', u'leaf', u'expression', u'mutant'], 
       dtype='<U19'),
 64: array([u'ptdins', u'cadpr', u'release', u'calcium', u'animal',
        u'signaling', u'osmotic', u'stress', u'increase', u'response',
        u'cell', u'gene', u'protein', u'activity', u'root', u'leaf',
        u'expression', u'mutant', u'plant', u'level'], 
       dtype='<U19'),
 65: array([u'indole', u'iaa', u'novo', u'vitro', u'pool', u'conjugate',
        u'synthesis', u'trp', u'regulation', u'maize', u'acid', u'cell',
        u'plant', u'protein', u'gene', u'activity', u'root', u'leaf',
        u'expression', u'mutant'], 
       dtype='<U19'),
 66: array([u'adh', u'hypoxia', u'spaceflight', u'plate', u'agar', u'induces',
        u'terrestrial', u'defective', u'growing', u'induction', u'addition',
        u'medium', u'ethylene', u'expression', u'pattern', u'gene', u'root',
        u'stress', u'mutant', u'arabidopsis'], 
       dtype='<U19'),
 67: array([u'lfy', u'phyb', u'flowering', u'gas', u'etiolated', u'hormone',
        u'transfer', u'level', u'show', u'regulation', u'change',
        u'seedling', u'light', u'potato', u'transcript', u'effect',
        u'accumulation', u'mrna', u'elongation', u'response'], 
       dtype='<U19'),
 68: array([u'radiation', u'screening', u'phenolic', u'uvb', u'heterodimer',
        u'penetration', u'chs', u'ester', u'landsberg', u'sinapate',
        u'solar', u'contribution', u'degradation', u'relative', u'center',
        u'mechanism', u'epidermal', u'reaction', u'increase', u'flavonoid'], 
       dtype='<U19'),
 69: array([u'pin', u'leaos', u'aos', u'opda', u'opdame', u'lehpl', u'acid',
        u'induction', u'expression', u'fatty', u'induced', u'gene',
        u'ethylene', u'level', u'cell', u'plant', u'protein', u'activity',
        u'root', u'leaf'], 
       dtype='<U19'),
 70: array([u'upstream', u'complementation', u'limiting', u'acclimation',
        u'reinhardtii', u'condition', u'region', u'group', u'signal',
        u'low', u'promoter', u'change', u'mutant', u'light', u'growth',
        u'cell', u'plant', u'protein', u'gene', u'activity'], 
       dtype='<U19'),
 71: array([u'cyclins', u'mitotic', u'modulated', u'nicta', u'cyclin', u'cdk',
        u'lyces', u'early', u'dpa', u'phase', u'fruit', u'cycle',
        u'meristem', u'organ', u'development', u'tomato', u'line', u'cell',
        u'tissue', u'activity'], 
       dtype='<U19'),
 72: array([u'stz', u'samt', u'salt', u'regulated', u'identified', u'ion',
        u'uptake', u'gene', u'kinase', u'expression', u'tolerance', u'aba',
        u'stress', u'arabidopsis', u'protein', u'activity', u'cell',
        u'root', u'leaf', u'mutant'], 
       dtype='<U19'),
 73: array([u'sdh', u'faa', u'bifunctional', u'lys', u'asp', u'lkr', u'optimum',
        u'hsdh', u'monofunctional', u'physiological', u'locus', u'value',
        u'feedback', u'thr', u'amino', u'enzyme', u'may', u'inhibition',
        u'two', u'high'], 
       dtype='<U19'),
 74: array([u'parsley', u'phenylpropanoid', u'sinapate', u'ester', u'conserved',
        u'promoter', u'identified', u'intron', u'cdna', u'biosynthesis',
        u'expressed', u'analysis', u'enzyme', u'accumulation', u'cell',
        u'plant', u'protein', u'acid', u'root', u'leaf'], 
       dtype='<U19'),
 75: array([u'cambial', u'latewood', u'vein', u'iaam', u'dormant', u'pmes',
        u'pattern', u'iaa', u'polar', u'isoform', u'distribution',
        u'inhibitor', u'active', u'stage', u'concentration', u'initiation',
        u'carbohydrate', u'formation', u'phloem', u'isoforms'], 
       dtype='<U19'),
 76: array([u'tuber', u'palatinose', u'nadme', u'contained', u'morphology',
        u'sense', u'metabolite', u'potato', u'subcellular', u'metabolized',
        u'starch', u'glycolytic', u'antisense', u'whereas', u'distribution',
        u'method', u'revealed', u'suc', u'effect', u'metabolism'], 
       dtype='<U19'),
 77: array([u'accase', u'bccp', u'database', u'fat', u'thioesterase',
        u'example', u'subunit', u'coa', u'encode', u'oilseed', u'est',
        u'silique', u'present', u'abundance', u'developing', u'time',
        u'product', u'isoforms', u'whereas', u'soybean'], 
       dtype='<U19'),
 78: array([u'ferredoxin', u'pfdiii', u'sulfite', u'sir', u'iii',
        u'desaturases', u'fnr', u'stroma', u'imported', u'heterotrophic',
        u'characteristic', u'desaturase', u'dark', u'precursor', u'import',
        u'reduction', u'affinity', u'envelope', u'isoforms', u'chloroplast'], 
       dtype='<U19'),
 79: array([u'gsts', u'chromatography', u'gst', u'gtases', u'glutathione',
        u'gamma', u'conjugate', u'flavonoid', u'revealed', u'gsh', u'iii',
        u'substrate', u'different', u'treatment', u'sequence', u'type',
        u'maize', u'cdna', u'enzyme', u'plant'], 
       dtype='<U19'),
 80: array([u'discrimination', u'ammonium', u'translocation', u'center',
        u'uninduced', u'influx', u'gln', u'glu', u'exposure', u'effect',
        u'abundance', u'transcript', u'time', u'amino', u'nitrogen',
        u'decreased', u'uptake', u'control', u'induced', u'system'], 
       dtype='<U19'),
 81: array([u'flc', u'photoperiods', u'flowering', u'transition', u'camv',
        u'delayed', u'triple', u'promote', u'vernalization', u'enhancer',
        u'floral', u'long', u'pathway', u'suc', u'sugar', u'concentration',
        u'condition', u'three', u'mutant', u'effect'], 
       dtype='<U19')}